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I 



EXECUTIVE SUMMARY 



The primary purpose of this research was to examine the effects of various 
Input string lengths and error correction methods on the recognition 
accuracy and efficiency of a currently available continuous automatic 
speech recognition (ASR) system. The effect of sex was examined also and 
an estimate of the average recognition accuracy of a continuous ASR system 
was sought. 

In the entry of numerical data, the input string length of seven digits at 
a time proved significantly more efficient than strings of three o'" five. 
Although subjects preferred some error correction methods over others, 
there were no significant differences in error rates or efficiency due to 
the correction method used. There were also no significant differences due 
to sex. 

The average recognition accuracy of the continuous ASR system was 
conservatively estimated at over ys%. These findings and areas of possible 
future research are discussed. 
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1. INTKUUUCTIUN 



1 . 1 background 

In recent years, voice technology has developed to the extent that basic 
systems have nov/ been used successfully in several industrial and military 
applications. Voice recognition devices that have been installed in "real 
world" situations have reduced input errors, cut task time, increased user 
friendliness, and proven cost effective in general (Nye, lybif; Poock, 
lybi?). This successful climate, along with continued reductions in the 
cost of voice recognition systems, has made voice input an attractive 
alternative to motor input in a wide variety of settings. 

Until recently, reliable ASK has been confined to recognition of discrete 
speech, that is, utterances of up to about two seconds in length and pauses 
of about IbU ms. between utterances. With the advent of continuous ASH 
systems the interactive process of ASK may be faster and more natural, 
increasing the efficiency of ASK and its potential applications. However, 
as with any new technology, new questions and issues need to be addressed. 
The most basic issue in continuous ASK (and ASK in general) is system 
efficiency. In effect, what input speed and accuracy can be expected in 
the operation of a continuous ASK system? As is the case with discrete 
ASK, the answer to this question can vary widely from one application to 
another. In particular, the type of vocabulary can significantly affect 
recognition accuracy (Armstrong and Poock, lybl). The vocabulary 
consisting of digits, zero through nine, warrants special attention due to 
its frequency of use across applications. Therefore, a study concerning 
numerical data entry via continuous ASK should prove most useful in terms 
of measuring the baseline recognition accuracy of the systf'm and in 
generalization of the results to other applications. 
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1.2 



Problem 



In the context of discrete ASK an investiyation of numerical data entry 
would be fairly straight forward. For example, a discrete digit is spoken 
and the ASR system displays feedback of the match or mismatch. In the case 
of a mismatch the speaker would immediately cancel the error with some key 
word like "erase" or "rubout," and then try again. System efficiency would 
be measured in terms of average input speed and accuracy. 

With the capabilities of continuous ASR the investigation becomes somewhat 
more complex. The first issue concerns the number of digits to constitute 
an input — since a truly continuous ASR system could accept any number of 
digits, from one to infinity, as a single input for which it is to produce 
a matching set. Assuming a fixed number of total digis (e.g., bU) will be 
input, different individual input string lengths may result in different 
speed and accuracy rates. The input of ' 2 ^ two-digit strings would regire 
24 inter-string pauses for recognition and feedback, compared to only four 
s.uch pauses with the input of five ten-digit strings. 

Coarticulation may also be a factor in string length. Coarticulation is 
the sim.ultaneous pronunciation of the end of one word and the beginning of 
another, e.g., "three-eight." The input of a two-digit string requires the 
ASR system to deal with only one coarticul ation in finding the boundary 
between the two words. However, a ten-digit string requires the processing 
of y coarticulations. 

Without a specific application in mind the determination of input string 
lengths for investigation becomes somewhat arbitrary. However, a speaker's 
short term memory for digits should be about seven, give or take a couple 
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(Miller, 19bb), placing a limit on the number of digits he or she can 
comfortably remember for input and mentally compare to output. Based on 
this assumption and practial considerations, the input string lengths of 
three, five, and seven, were chosen for investigation. 

Another issue is error correction. In the discrete ASR of digits each 
output is either 1UU% right or 1UU% wrong. However, with continuous ASR an 
output may be partially incorrect. For example, the input string is "1, 4, 
3, b, 2" and the output is "1, 4, 3, 9, 2." These errors could be handled 
like discrete ASR errors, in which case the speaker would isue the "ERASE" 
or "RUBUUT" command and try again, Hov/ever, other methods of correction 
that address only the incorrect portion of the output may be more 
efficient. In the example above it may be faster to change the 9 to a b 
than to erase the entire output and repeat the whole string again. In 
addition, addressing only the specific error (changing the 9 to a b) gives 
the ASR sysem a different speech input to correct the error (e.g., "CHANGE 
THE 9 TO A FIVE") rather than the same speech input ("1, 4, 3, b, 2") which 
has already demonstrated a propensity for mi srecognition . 

The question then, is what are the alternative correction methods for 
partial errors? The possibilities are limited only by one's imagination 
and degree of control over the feedback display. Four error correction 
methods were chosen for use in the experiment: 

1) "RUBOUT" - erases the entire output regardless of 

partial or total error. 

E.g., Input = 1, 4, 3, b, 2 

Output = 1, 4, 3, 9, 2 Subject says "RUBUUT," "1, 4, 3, b, 2" 

2) "RUSITIUN X MAKE-IT Y" - changes xth digit 

(from left to right) to Y 
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E.g., Input = 1, 4, 3, b, 2 

Uutput = 1, 4, 3, y, 2 Subject says "PUSITIUN 4 MAKE-IT b" 



3) "backup X MAKE-IT Y" - changes xth digit 

(from right to left) to Y 

E.g., Input = 1, 4, 3, b, 2 

Output = 1, 4, 3, 9, 2 Subject says "BACKUP 2 MAKE-IT b" 

4) "CHANGE X (nth ONE) MAKE-IT Y" - changes the nth X 

to Y, if n is not stated then the first X 
(from left to right) is changed to Y. 



E.g 



» 



Input = 1, 4, 3, b, 2 
Uutput = 1, 4, 3, y, 2 



Subject says "CHANGE 9 MAKE-IT b" 



E- . g . , 



Input = 1, 9, 3, b, 2 
Uutput = 1, 9, 3, 9, 2 



Subject says "CHANGE 9 SECONU-UNE MAKE-IT b" 



NOTE: Subjects use continuous speech to say everything in the examples 

above. 
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1.3 Objectives 



The specific objectives of this research were as follows: 

(1) To examine the effects of 3 different Input string lengths on 
continuous ASR accuracy and efficiency. 

( 2 ) To examine the effects of four different correction methods on 
continuous ASR efficiency. 

(3) To examine any interaction effects of the three string lengths 

with the four corr'-cticr in of accur^c^ jn ' 

e f f ' c i en cy . 

(4) To obtain an estimate of the recognition accuracy of a 
currently available continuous ASR device. 

(b) To examine the effects. If any, of gender on accuracy and 
efficiency. 
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2. MtTHUU 



2.1 Subjects 

Twelve volunteers were recruited prifrarily from the Naval Postgraduate 
School in Monterey, CA. Six males included 4 Naval officers, 1 Marine 
officer, and 1 National Reservist, 4 secretaries, and 1 elementary school 
teacher not associated with the Naval Postgraduate School. One subject had 
worked with ASR for about 3 years. Three subjects had about 3 hours of 

experience each with a discrete ASR system and the remaining 8 subjects had 

never used an ASR system. Five subjects had previous microphone experience 
as pilots, navigators, or radio operators. 

2.2 Apparatus 

A Verbex 3UUU continuous ASR system was used in this study. The system is 
capable of recognizig natural continuous speech of indefinite length, 
limited only by an output buffer of 240 characters per recognition set. 

A Shure model SM12A headset microphone was used as the input device. This 
microphone is supplied as standard equipment- with the Verbex. 

Prompts and recognition sets were displayed on Lear Siegler ADM31 video 
display terminal. 

2.3 Experimental Uesi gn 

This experiment employed a b x 3 x 2 mixed design. Five correction methods 
were crossed with three input string lengths. The correction methods were 
RUBOUT, POSITION, BACKUP, CHANCE, and ALL in which the previous four 
methods were all available. The input string lengths were 3, b, and / 
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digits. Two groups of subjects, 6 males and b females, constituted the 
between subjects variable, and experienced all combinations of correction 
methods and input string length. A summary of the experimental design 
appears in Figure 2-1. 

2.4 Procedure 



2.4.1 Introduction . The experiment was divided into a training session 
lasting 4b minutes and a test session of 4b minutes. Subjects signed up 
for the individual 4b minute sessions at their convenience. Seven subjects 
did the training and testing sessions on separate days no more than one 
week apart. 

The sessions took place in the lab at the Naval Postgraduate School. 
The Verbex was located in a 18 by 16 foot acoustically paneled room with 
several other computer terminals and peripherals. During the course of any 
session it was common to have several people talking and typing in the 
room. Also located in the room was a heavy (lead shielded) pneumatically 
operating sliding door. The opening and closing of this door produced 
significant noise in the room, peaking at approximately 8b dbC, and as the 
main entrance to the lab, it was opened and closed frequently. Although 
the various sources of noise were considerable, no measures were taken to 
reduce or control them. The resulting sound level environment ranged from 
64 to 8b dbC with a mode of about /2 dbC. 

At the beginning of each subjects' first session the experimenter described 
the experiment and gave a demonstration of how the continuous ASK system 
would later be used by the subject for numerical data entry and error 
correction. 



2.4.2 T rai ni nq . After the den'.onstration the experimenter led the subject 
through the training phase. The term "training" as used in ASR, refers to 
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FIGURE 2-1. 

SUMMARY OF EXPERIMENTAL DESIGN 



the process by which the speaker makes known to the recognizer the 
characteristics of his/her particular speech patterns for all the 
utterances he/she will be using. Twenty utterances were used in the 
current study (see Appendix A). For the Verbex 3UUU this training 
procedure consists of two phases, isolated and continuous. In the isolated 
training phase the speaker says each utterance in the vocabulary at least 
twice by itself (discrete or isolated). Isolated training was suspended 
when the pneumatic door was activated since Verbex recommends isolated 
training in a quiet environment. 

In the continuous training phase up to three utterances are grouped 
together and spoken continuously. Each utterance was included in about ZU 
such groups and was therefore coarticulated about 2(J times during the 
continuous training phase. Two hundred such groups of utterances were 
spoken. Subjects were reminded to speak in a natural voice, but somewhat 
more quickly than in normal conversation, since they would be speaking 
rapidly in the subsequent test session. Continuous training proceeded 
throughout noises from the pneumatic door and talking in the room. The 
continuous training phase took approximately 2b minutes per subject. 

As a result of these training phases the Verbex retains a template in 
memory on each utterance. Ideally, subsequent utterances (in testing) are 
matched with the template for the same utterance in memory, resulting in a 
correct recognition and output. In cases where a match is not found, a 
nonrecognition or rejection occurs and the Verbex makes no response. 
Uccasional ly , the recognizer makes an incorrect match and an incorrect 
response is output, constituting a mi srecogniti on or misinterpretation of 
the utterance. 



ZA.'i Testing . Before data collection each subject completed a practice 
session. In the practice session a randomly generated five digit prompt 
appeared In the upper right-hnd corner of the display screen. The subject 
spoke the digits and the recognition set was output directly below the 
prompt. If the recognition was 1UU% correct, the screen cleared after 2.9 
seconds and the process was repeated with a new prompt. If any part of the 
output was Incorrect the display remained the same until a correction 
command was entered. The output string was then Immediately modified to 
reflect the correction. If no further corrections were necessary the 
screen cleared and a new prompt appeared In 2.9 sec. All four correction 
methods were available. The subject was to say "KESTAkT" whenever the 
Verbex produced no output to the subjects Input. An audible beep signaled 
recognition of the "ktSTAKT" command. 

Subjects were reminded that In the test phase total Input time would be 
measured and were Instructed to test the limits of te ASk system during 
practice by speeding up their Inputs until they resulted In output errors. 
This gave the subjects a good estimate of how fast they could enter the 
digits as well as the cost (In time) of correcting errors. Each of the 
correction methods was practiced until the subjects demonstrated a clear 
understanding of, and ability to quickly execute, all of them. 

Once the subject completed practice and the experimenter answered any 
questions the data collection began. The task was to Input a total of lUb 
digits, with corrections. In as short a time as possible. Each subject 
entered lUb digits — 3 at a time, b at a time, and 7 at a time — under 
each of the b correction conditions. The order of the four correction 
methods RUBOUT, POSITION, BACKUP, and CHANCE were counter balanced for 
both sequential position and preceding condition (Bradley, 1976). The ALL 
correction condition was always done last. For each subject, five of the 
six possible string length sequences were chosen randomly and randomly 
ordered across the five correction conditions. 



In testing, the prompt of 3, b, or 7 digits appeared in the upper right 
hand corner of the screen. The subject then said the digits. If the 
system could not find a match or only "heard" a portion of the input, it 
made no response, in which case the subject would say "KESTAKT," hear a 
beep, and try again. If there was an error in the recognition output, the 
output string was displayed directly below the prompt and remained there 
until a correction command was recognized. The output string was 
inmediately modified to reflect the correction. If "RUBUUT" was used the 
output string was replaced by dashes (- - -). Once the output string 
matched the prompt (whether on the original input or after one or more 
corrections) the screen was immediately cleared of both prompt and output 
and a new prompt appeared. The process of recognition, accuracy checking, 
screen clearing and presenting a new prompt took U.« seconds in all 
conditions. This process was repeated until a total of lUb digits had been 
correctly recognized. Tne experimenter timed each run and the computer (in 
the Verbex) tracked and reported the number of errors (corrections and 
nonrecognitions). The experimenter recorded the time and errors at the end 
of each run. 

After the subjects completed all the conditions they were asked if they 
preferred any correction method(s) over the others, and if there was any 
correction method they thought was either so useless or confusing that they 
would not even make it an option. Responses to these questions were 
recorded by the experimenter and the test phase was concluded. 

Independent and Dependent Variables 

The independent variables in this study were input string length (3, b, 7); 
error correction method (KUBULiT, POSITION, BACKUP, CHANGE, all); and sex. 
The dependent variables were efficiency (time to input lUb digits 
correctly) and recognition accuracy. 



3. KESULTS 



3.1 Overview 

For error data all analyses of variance procedures and post hoc range tests 
were performed using the arcsin transformation of raw data to stabilize the 
variance of the error terms (Neter and Wasserman, iy74). The mean error 
and time rates that appear in the tables and figures are untransformed. 
All a posteriori tests for significance between pairs of means were 
performed using the Scheffe procedures described in Pruning and Kintz 
(1977). 

Section 3.2 presents the data on efficiency (time to correctly recognize 
lub digits). Section 3.3 presents data on total errors. Section 3.4 
presents data on subjects responses to post test questions. 

3.2 Efficiency 

Table 3-1 presents the analysis of variance for efficiency (time to 
correctly recognize lUb digits). A Significant main effect of input string 
length was found (F = 2o.L)by, p < .UUl). No other main effects or 
interactions were significant. Mean total time (in seconds) for input 
string length by correction method are shown in Table 3-2. The main effect 
of input string length is shown graphically in Figure 3-1. 

Scheffe tests were performed to detect single effects between input string 
lengths. Inputting digits seven at a time was significantly more efficient 
than both five at a time and three at a time, at the p < .Ub level. 
Inputting b digits at a time was significantly more effiient than 3 at a 
time at p < .1 level . 
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TABLE 3-1 



ANALYSIS OF VARIANCE SUMMARY TABLE 
OF EFFICIENCY 



SOURCE 


df 


MS 


F 


GENDER (G) 


1 


2191.022 


.442 


ERROR 


10 


A958.616 




CORRECTION METHOD (C) 


a 


635.506 


.874 


C G 


a 


76a.l06 


1.051 


ERROR 


ao 


727.249 




INPUT STRING LENGTH (L) 


2 


11242.839 


25.059 * 


L G 


2 


134.372 


.300 


ERROR 


20 


448.652 




L C 


8 


436.131 


.899 


LOG 


8 


304.164 


.627 


ERROR 


80 


485.186 





* P<.001 
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TABLE 3-2 



MEAN TOTAL TIME (IN SECONDS) FOR 
INPUT STRING LENGTH BY CORRECTION METHOD 





INPUT STRING LENGHT 




3 


5 


7 


— CORRECTION 
METHOD 


c 

0 

R 

R 

E 

C 

T 

I 

0 

N 

M 

E 

T 

H 

0 

D 


RUBOUT 


107.67 


88.50 


73.08 


89.75 


POSITION 


98.A2 


80.58 


76.50 


85.17 


BACK-UP 


102.08 


105.67 


82.00 


96.58 


CHANGE 


105.50 


100.92 


72.50 


92.97 


ALL 


102.83 


93.42 


77.50 


91.25 




X 

INPUT STRING 
LENGTH 


103.30 


93.82 


76.32 


91.14 

GRAND X 
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TIME IN SECONDS 




3 5 

INPUT STRING LENGTH 

FIGURE 3-1. 

MEAN TIME (IN SECONDS) BY INPUT STRING LENGTH 
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3.3 



Total Errors 



Table 3-3 presents the analysis of variance for total errors. A 
significant main effect of input string length was found (F = 6.446, p < 
.01). No other main effects or interactions were significant. Mean total 
errors for input string length by correction method are shown in Table 3-4. 
The main effect of input string length is shown graphically in Figure 3-2. 
Scheffe tests were performed to detect simple effects between input string 
lengths. Inputting digits seven at a time resulted in significantly fewer 
errors than when digits were input 3 at a time or b at a time (p < .Ob). 
However, the difference in errors resulting from inputting 3 digits at a 
time versus b digits at a time was statistically non-significant P > .2b). 

Table 3-b presents the results of subjects' choice of correction methods in 
the ALL condition and responses to the post-test questions. Subjects 
reported favoring CHANGE the most and BACKOP the least. However, in the 
correction condition in which all correction methods were available, CHANGE 
was used most and KUBOUT was used least. 



TABLE 3-3 



ANALYSIS OF VARIANCE SUMMARY TABLE 
BY TOTAL ERRORS 



SOURCE 


df 


MS 


F 


GENDER (G) 


1 


.12552 


..474 


ERROR 


10 


.26469 




CORRECTION METHOD (0) 


A 


.00389 


.093 


0 G 


4 


.05264 


L264 


ERROR 


40 


.04165 




INPUT STRING LENGTH (L) 


2 


.15870 


6.446* 


L G 


2 


.00527 


.232 


ERROR 


20 


.06462 




L 0 


8 


.02829 


.849 


LOG 


8 


.01521 


.457 


ERROR 


80 


.03331 





* P< .01 
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TABLE 3-4 



MEAN TOTAL ERRORS (IN PERCENT) FOR 
INPUT STRING LENGTH BY CORRECTION METHOD 





INPUT STRING LENGHT 




3 


5 


7 


— CORRECTION 
^ METHOD 


c 

0 

R 

R 

E 

C 

T 

I 

0 

N 

M 

E 

T 

H 

0 

D 


RUBOUT 


4.545 


3.979 


2.315 


3.613 


POSITION 


4.307 


3.837 


3.523 


3.889 


BACK-UP 


4.121 


6.839 


4.262 


5.074 


CHANGE 


4.101 


6.662 


2.559 


4.441 


ALL 


4.631 


5.394 


3.772 


4.599 




X 

INPUT STRING 
LENGTH 


4.341 


5.342 


3.286 


4.323 

GRAND X 
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FIGURE 3-2. 

MEAN TOTAL ERRORS (IN PERCENT) BY INPUT STRING LENGTH 
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CHOICE OF CORRECTION METHODS AND 
RESPONSE TO POST-TEST QUESTIONS 
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33% WOULD NOT OMIT ANY METHOD 



4. DISCUSSION 



This section will discuss the current findings with regard to the 
objectives put forth earlier in this report. 

4.1 Effects of Input String Length 

In terms of both accuracy and efficiency the results clearly demonstrated 
the advantages of inputting digits seven at a time compared to three or 
five at a time. This superior efficiency is probably a function of the 
relatively low number of both interstimulus pauses and errors, associated 
with the longer input string length. With an inter-stimulus pause of .8 
seconds, error free pause times using the input string lengths of three, 
five, and seven, were seconds, 16.8 seconds, and I'd seconds, 
respectively. Consideration of the inter stimulus pause reveals some 
noteworthy facts. If these pause times are deducted from the respective 
condition means, the differences among the resulting times are 
substantially reduced and in the case of input lengths three versus five, 
the direction of the difference is reversed (see Figure 4-1). However, the 
input string length of seven remains the fastest even if inter-stimulus 
pauses are eliminated completely, therefore, time is not solely a function 
of number of interstimulus pauses. Rather, time is a function of both 
number of interstimulus pauses and number of errors to correct. 

Revised efficiency (time to input 1U5 digits minus error free interstimulus 
pause time) appears to be primarily a function of error rate. This 
supposition is supported by the data presented in Figure 4-2? which relates 
revised efficiency to error rate, by input string length. 
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TIME (IN SECONDS) 



105 

100 

95 

90 

85 

80 

75 

70 

65 

60 

55 

50 








HUilF 






LEGEND 


□ 


TOTAL TIME 




TOTAL TIME MINUS 
INTERSTIMULUS PAUSE TIME 



/ X 




INPUT STRING LENGTH 



FIGURE 4-1. 

TOTAL TIME AND TOTAL TIME MINUS INTERSTIMULUS 
PAUSE TIME BY INPUT STRING LENGTH 
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REVISED EFFICIENTLY 
IN SECONDS 




3 5 7 



INPUT STRING LENGTH 



FIGURE ^- 1 . 

REVISED EFFICIENCY AND ERROR BY INPUT STRING LENGTH 



80 

70 

50 

50 
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Effects of Correction Methods 



4.k; 



There were no siynificant differences in accuracy or efficiency as a result 
of the various correction methods, and correction method did not interact 
with string length or gender. It was the experimenter's observation that 
the vast majority of errors consisted of one misrecognized digit in the 
spoken input string. Based on this observation one might expect the RUBUUT 
method to reduce efficiency, especially with the string length of seven, 
since the <^ntire string had to be repeated after the correction command was 
spoken, onri since this correction process involves two inter stimulus 
pauses compared to only one in all other correction methods. Hindsight and 
statements made by subjects provide some feasible explanations for the 
absence of this outcome: 

(1) In using the RUBUUT method the subject did not have to search 
the output string for the specific error, determine its 
position or identity, and plug this information into the 
correction command format. This may have given RUBUUT a speed 
advantage over the other three methods which required the 
subject to perform additional mental processing and verbal 
formatting. 

( 2 ) A floor effect cannot be ruled out since there were very few 

errors under all conditions (grand x = and RUBUUT x = 

3.61%). As a result, subjects had an opportunity to implement 
a correction method an average of only four or five times per 
condition. 
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Subjects preferred the CHANGE method most and BACKUP the least and used the 
CHANGE method more than twice as often as any other method above given a 
choice of all four methods in the ALL conditions. Although statistically 
condition, which by far, the most subjects chose as the correction they 
would omit in a numerical data entry task. 

4.3 Estimate of Recognition of a Continuous ASR Device 

The recognition accuracy of the ASK system averaged 9b. 68%. In many ways 
this was a conservative estimate of the systems capabilities: 

(1) While the system has the capability to adjust its gain level 
to speech versus background noise, this setting remains 
constant throughout training and testing. Therefore, sporadic 
noise changes such as those caused by the penumatic door 
opening and closing and the voices and typing of additional 
individuals entering (or leaving) the room are not 
accommodated by the gain level, and present a formidable 
challenge to the ASR device. 

( 2 ) Subjects were instructed to speak more rapidly than in normal 
conversational speed, increasing the degree of coarticulation 
and, presumably, making the task of speech processing more 
difficult than usual. 

(3) Subjects spent only 2U minutes actually providing speech for 
template creation and, unlike many previous studies, "problem" 
words (words often confused with other words) were not 
retrained to an improved recognition criterion (Poock & 
Hartin, 1983; Poock, Schwalm, Martin, and Roland, 1982). 



(4) Une error was recorded for each nonrecognition or correction 
made. In some cases, one or two errors may require several 
corrections. For example, the input string "1, 2!, 3, 4, b, b, 
7" is soken but the "1" is not recognized, constituting the 
first error. The ASR device now has "3,3 4, b, b, 7" in its 
recognition buffer and erroneously takes the next input 
(RESTART on background noise) as the seventh digit, 
constituting a second error. The resulting output buffer is 
"3, 3, 4, b, b, 7, U" and In three of the four correction 
methods the subject had to make seven corrections to correct 
the entire string (e.g., POSITION 1, MAKE-IT 1, POSITION 3 
MAKE-IT 3, BACKOP 1 MAKE-IT 7, CHANGE b MAKE-IT 4, etc.). As 
a result, one error of omission and one error of insertion 
lead to seven corrections, and would be recorded as seven 
errors rather than as two. 

Finally, one factor should be noted that may have worked in favor of the 
ASR system. The vocabulary size of only 3U utterances is relatively small 
and the branching complexity of the grammar structure was fairly simple 
(see Figure 4-3). 

4.4 Effects of Gender 

As expected, gender did not significantly effect either accuracy or 
efficiency. This supports the findings of Batchellor (lyBl). 
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digit . digit . digit . digit . digit 



OR 



RESTART 



OR 



RUBOUT 

OR 



POSITION 


.place MAKE- IT .digit 


OR 


BACKUP 


.place MAKE-IT .digit 


OR 


CHANGE 


.digit <. which one> MAKE-IT .digit 


digit = 


.place = 


.whichone= 


ZERO 


ONE 


FIRST-ONE 


ONE 


TWO 


SECONDONE 


TWO 


THREE 


THIRD-ONE 


THREE 


FOUR 




FOUR 


FIVE 




FIVE 


SIX 




SIX 


SEVEN 




SEVEN 






EIGHT 






NINE 






OH 







FIGURE 4-3, 

GRAMMAR STRUCTURE OF NUMERICAL DATA ENTRY VOCABULARY 
FOR SEVEN DIGIT STRING 



digit 



4-7 



b. CONCLUSION 



This exploratory study provided interesting and useful findings. The 
spoken entry of seven digits at a time proved significantly more efficient 
than shorter strings of digits. This effect prevailed despite the greater 
speech processing required by the Verbex, and the added processing imposed 
on the subject in repeating, checking, and correcting the seven digit 
strings versus the shorter input strings of three and five. The reason for 
such an outcome is currently unknown. The investigators only speculation 
is that the longer string (with resulting -- fewer pauses) was less prone 
to errors caused by the peek noises of the loud pneumatic door. Future 
research is suggested to test this speculation by repeating the basic 
experiment in a consistent sound level environment. In the meantime, input 
strings of seven digits are recommended for numerical data entry because of 
their efficiency in terms of error rate and minimal inter-input pauses. If 
input string length does interact with peek background noise, the use of 
seven digit strings becomes even more attractive. 

Since no effects were associated with correction method, we suggest 
including CHANGE, KUBOUT, and POSITION as options for numerical data 
correction. BACKUP is deleted on the basis of high subject disappproval 
and low use. 

The average recognition accuracy rate of over yb% is conservative and 
promising. We believe the outlook for ASK, especially in numerical data 
entry, is greatly improved with the advent of reliable continuous speech 
recognition capabilities such as those now available in the model used in 
our tests. 
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Appendix A 

Numerical Data Entry Vocabulary 



1. ZERO 

2. UNE 

3. TlJO 

4. THREE 
b. FOUR 
b. FIVE 

7. SIX 

8. SEVEN 
y. EIGHT 

10. NINE 

11. OH 

12. RUBUUT 

13. RESTART 

14. POSITION 
lb. BACKUP 

16. CHANGE 

17. FIRST -ONE 

18. SECUNUUNE 

19. THIRD-ONE 
2U. MAKE- IT 
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